Prosper Loan Data Visualization Project - Exploration

Table of Contents

Wrangling Data for Exploration

What is the structure of the dataset?

This dataset has 112,342 rows with 13 columns with the following variables:

What is the main feature of interest looking at this dataset?

The main feature of interest is looking at borrower rate and seeing what factors affect a borrower getting a certain rate when acquiring a loan.

What features in the dataset will support the investigation into the main feature of interest?

To some degree, I feel most of these variables will have some effect to a borrower getting assigned a certain interest rate. However, I suspect the borrower's credit score would probably have the largest effect for a person getting a certain interest rate.

Univariate Exploration

Looking at this histogram, most of the borrowers fall in the range of 4% to 36% notably with two spikes occuring around 15% and another spike at around 30%. Looking at outliers, there are only 37 values below 4%

Looking at this, this might be a variable I probably will not investigate any further.

For the most part, it shows around 82% either current or completed while the other 18% is either past due, charged off, or defaulted. It doesn't quite explain the two spikes in the dataset. It is possible that one borrower had a bad credit history is staying current or completed their loan agreement or a borrower that has a good credit rating might fall behind on their loan payments. However, if the loan is not in a good status, the loan company may raise the interest rate as the risk is higher.

This graph shows most people (at least 90%) are employed in some fashion, it might not explain the two spikes in interest rate.

It looks like most borrowers make at least $25,000, this might have some correlation to interest rate. It might be worth investigating against interest rate to see some kind of correlation.

Looking at credit scores, I suspect there might be a correlation as most borrowers have between 600-800 credit score. It could definitely be one factor explaining the two spikes in interest rate.

This shows that a little more than half are homeowners. It might affect assigned interest rate as there is less risk if a borrower is a homeowner.

Having a current loan that is delinquent would raise the assigned interest rate as there is a greater risk that the loan company may not get their money.

Usually, the loan company will look at different factors like employment and assets a borrower has before letting a borrower have a certain amount of money looking at risk. It is possible that higher amounts of money might have a lower interest rate than smaller loans.

Looking at employment duration, it can be a factor in the ability at borrower can make payments. This could be a factor in assigning a borrower a certain interest rate.

Having multiple lines of credit might affect interest rate depending on if the borrower is making current payments on it or not. It does show history and would lower the assigned interest rate.

If there are delinquencies, a loan company might not loan a person anything. However, if the organization decides to give a borrower a loan, they would assign a borrower a higher interest rate. Looking at the linear and the logarithmic versions, it is unclear how exactly it would affect interest rate.

Having a low debt-to-income ratio shows less risk. If a borrower gets a ratio of 1, it shows the gross income a borrower makes equals the amount of total debt a borrower has to pay. A high debt-to-income ratio would affect assigned interest rate. However, both graphs are unclear how exactly it would affect interest rate.

Discuss the distribution of your variable of interest. Were there any unusual points? Did you need to perform any transformations?

Looking at borrower rate, it shows most borrowers fall between 4% and 36%. The highest interest rate is around 36% while the lowest is 0%. I assigned a new dataframe to investigate these outliers and sorted it by interest rate. It shows only 37 borrowers that fall below 4% interest.

Of the features you investigated, whre there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most of the numerical distributions were right-skewed. In these instances, I plotted a logarithmic graph along side it to look at the distribution to see if there are any patterns that could be observed. In the case of loan amount, employment duration, and current credit lines, one shows a bell curve distribution while the other two show spikes at certain points. While there were a couple of variables that were still right skewed even after a logarithmic conversion, it might be more clear using bivarate analysis.

Bivariate Exploration

Now, I want to look at a scatterplot between the interest rate variable and the other numeric variables using a pair grid scatterplot.

These sets of scatterplots will observe the numeric variables in reference to interest rate using the sample dataframe

This first row will observe number of days the loan is delinquent and the original amount of the loan versus interest rate

The second row will observe employment status duration and current credit lines versus interest rate

The final row will observe current delinquencies and debt-to-income ratio versus interest rate

Three of these six scatterplots reveal a correlation. A positive correlation in exists in current delinquencies (interest rate increases the more delinquencies the borrower has). Also, a positive correlation with borrowers receiving a higher interest rate is present examining Debt-to-Income ratio. Finally, a negative correlation is present in the original loan amount. The higher the amount is borrowed, it might result in a lower interest rate for the borrower.

These next sets will observe the relationship between other numerical variables to see if a correlation exists

Current Credit Lines vs Current Delinquencies has a negative correlation. The rest of the graphs do not really show a correlation. I'll probably carry this further with multivarate analysis.

Three of the four graphs look similar. The only exception would possibly be debt-to-income ration compared to current delinquencies might have some kind of correlation. Either way, I'll won't go further with these comparisons.

Between these three graphs, it shows a slight negative correlation between current delinquencies and employment status duration. The other two don't really show a correlation. I'll will carry current delinquencies vs employment status duration in multivarate analysis.

Out of these two, there is a correlation between current delinquencies and the original amount of the loan. I'll explore this relationship further in multivarate analysis.

No correlation is present comparing the original amount of the loan and how delinquent the current loan is. This comparison probably won't be explored further.

The next set of plots will look at the different categorical variables and how they compare with the interest rate. Another dataframe will be created using only the categorical variable columns.

Each of the categorical variables show a definite relationship with interest rate. The borrower rate is typically higher with a negative status. The violin plot illustrating employment or other kind income show a borrower having a lower interest rate. If a borrower is a homeowner, they typically have a lower interest rate with a wider bottom and narrower top on the homeowner side of the violin plot. The box plot showing credit scores definitely show that a borrower would have a lower interest rate the higher the score the borrower has. Finally, having a higher income correlates a borrower having a lower interest rate. Overall, the strongest relationship is definitely credit score.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The strongest relationship, as suspected is credit score vs interest rate. The other categorical variables also show some correlation between interest rate. When comparing interest rate and other numerical variables, only two numerical variables having a slight positive correlation are comparing interest rate with current delinquencies and debt-to-income ratio. There is a negative correlation comparing the original amount of the loan compared with interest rate.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I looked at comparing the other numerical variable with each other that didn't have a direct correlation with interest rate. All of the following have a negative correlation when compared with current delinquencies: current delinquencies vs loan original amount, current delinquencies vs employment status duration, current delinquencies vs debt-to-income ratio, and current delinquencies vs credit lines. Current delinquencies, interest rate, and one of the other four will be compared in multivarate analysis to see if a relationship exists. Comparing the remaining variables with each other did not show any correlation.

Multivariate Exploration

To start with I want to compare interest rate, current delinquencies and a third numerical variable to see if a relationship exists.

This shows the relationship between debt-to-income ratio and interest rate. These four scatterplots adding in the third numeric variable, loan original amount, in my opinion, doesn't really add substance and none of these will be used. I'll look at another numeric relationship pair that was established in bivariate analysis. These next sets will look at interest rate vs loan original amount adding in the other four numerical variables.

Again, the relationship doesn't emphasize much. There might be a small exception looking at adding current delinquencies. There is a small concentration of values between 25-30% with a loan amount around $5000. These won't be used. Now I want to look at the two numerical variables vs a categorical variable

While most of these relationships are unclear, one suprising thing is looking at credit score in the range of 520-640. A relationship exists where a higher interest rate correlates with longer employment duration. For the most part, I'll exclude these despite that one relationship that was found.

The final thing I want to explore is credit score, interest rate, and the other four categorical variables

Looking at these pointplots, it looks like a mess with the exception of interest rate vs credit scores looking at homeownership. I'll do barplots on these to see if the relationships are clearer.

With the exception of the illustrating borrower rate vs credit score range emphasizing if the borrower is a homeowner, comparing the other categories shows some confusion and ambiguity. To simplify some of these bar graphs, I'll combine or get rid of some columns.

Simplifying these bar graphs help illustrate the relationship between Interest Rate and Credit Score. Overall, the negative status items show a higher interest rate compared to its positive counterparts. Surprisingly, looking at the lower credit score range (520-640), it shows higher income earners having a little higher interest rate comparted to their lower income counterparts.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I started out comparing three qualitative variables with interest rate being one of the three. For the most part, it was a dead end and did not reveal much. I then tried using two numeric variables compared with one categorical variable. The only relationship that was present was employment duration vs interest rate where the credit score was between 520-640. The final sets observed two categorical variables versus interest rate. One showed a suprising trend when looking into homeownership as a factor when comparing interest rate and credit score. The other bar plots are not too suprising and made sense. I had to simplify three columns to visualize the bar graphs more clearly.

Were there any interesting or surprising interactions between features?

The first interesting trend that there was a positive correlation looking at employment duration vs interest rate when a borrower's credit score falls between 520-640. It shows a higher interest rate the longer a borrower has been employed, which surprised me. The second thing that was interesting is looking at homeowner status when comparing credit score and interest rate. For most of the credit card ranges, it shows a slightly lower interest rate for homeowners where the violin plot showed otherwise.